INTRODUCTION:

Computational Intelligence Methods applied for the initial stages of Drug Development has been evolved in recent years and now become a mandatory approach in almost all phases of Pharmaceutical Field. Supervised Machine learning (ML) approaches have been showing a specific interest, because of the application of sophisticated techniques in majority the steps of the drug

Research and Development methodology, such as identification of structure of the target sequence, prognostication of biological activity via model construction, and construction of models that identifies the toxicological profile and pharmacokinetic of compounds. Structural biology field has one of its major role to provide the 3-Dimensional structures of proteins and protein complexes and identifies the insights/impact regarding their structure and in-turn its functions for efficient drug discovery. Determination of protein secondary structures using experimental approach would take more time and the process is very slow, and so there might be delay in the generation of potential drug targets. The transparency in identifying the protein secondary structure and other super secondary structures (motif's) leads to a good understanding of various human diseases and thereby leads to the development of new drugs and novel enzymes in the forthcoming years.

Proteins are complex molecules composed of long strings of twenty different types of amino acid. The length of the string and the order of amino acids are vitally important for the protein to function properly in its biological role. This part of the process of protein function, the gene encoding the protein determines these factors. A single mistake in the gene may cause the wrong amino acid to be incorporated into the sequence or a nonsense mutation may cause the protein to be truncated. However, protein function is more directly determined by the protein’s three dimensional shape, the protein structure, and the availability of non-protein cofactors. Understanding of the proportionality between protein sequence, structure and thereby functions is one of the greatest challenges in bioinformatics and is confirmed that the primary sequence holds enough knowledge to determine the three dimensional structure of a protein. Also it is extremely a complex process to predict the structure of the protein from the primary sequence which makes us difficult to understand the protein function which aids the researchers to apply for applications aiding drug design and development. Several research reports have shown that a stronger correlation exists between structures which leads to protein function^1,2. Moreover, similar sequences in proteins shows that they also have similar structures. Understanding protein structure is considered as one of the critical phase for elucidating the protein function and has the basic significance in proteomics including protein function analysis, protein engineering, genome annotation, protein design and drug design and development. Thus, the prediction of secondary structure of protein is being considered as a criticalarea in Proteomics field. The amino acid sequence data is utilized from Protein Data Bank (PDB) and other data banks was processed and experimental methods such as X-ray crystallography or Nuclear Magnetic Resonance (NMR) were used to identify the secondary structures. These methods remains too costly and takes more time which leads to the research of Machine Learning based techniques as an alternative approach. It is best combined with other computational techniques in order to obtain a detailed idea of how the protein functions.

Computational Intelligence Tools such as machine algorithms have been used in bioinformatics to identify the insights and significance from Protein Data. With the accumulation of the experimentally validated protein structural data, researchers are focusing on using computational algorithms to predict protein structures^3,4,5Although various prediction methods have been widely adopted in industry, the research is still at an early stage as lack of domain expertise in proteomics area together with computational knowledge leads to inefficient prediction results with less accuracy⁶. Also the structure prediction approach can be quickly done if the entire investigations are combined with a distributed computing model using advanced Machine Learning Approach.

The literature reviews covered in this survey focus on applications of Computational Techniques, especially Machine Learning approaches applied with the goals of extracting the features from the amino acid sequence of proteins using Distributed Computing. This review paper mainly focus on three aspects 1) examine the properties of amino acid and features involved for prediction to attain good accuracy; 2) review research involving the use of machine learning Techniques for the prediction of protein secondary structures for drug discovery and 3) Identify a Distributed Framework which aids in improving the performance of structure prediction methods. Accordingly, this review is divided into three main sections, First section explains about the properties of amino acids which aids in identifying the secondary structure and the various features used. The second section focus more on discussing papers on Machine Learning techniques applied so far for Predicting Secondary Structures. The last section discussing papers on distributed framework aids in improving the performance of the Structure prediction Techniques. As an overall survey of subsisting works, we believe that this paper could provide valuable insights and be auxiliary for researchers to apply Machine learning in their Bioinformatics research thereby procuring a good precision and performance on Protein structure prognostication Research.

Amino acid –Features and Properties:

Proteins will form the substructure such as hair, skin and tendon and accommodate as receptors, hormones, storage, bulwark, enzymes and act as conveyors of signals in our body parts responsible for catalyzing and regulating biochemical reactions, conveying molecule. They are composed of simple building blocks called amino acids and are most paramount for living organisms. Secondary structure refers to the ordering of amino acid connections to compose secondary structures like β-strand, α- helix, coils and sometimes motifs. The information available in the protein primary sequence enables the protein fold and forms a unique shape. The proteins fold into more than one spatial conformations which aids them to perform their biological function in a more efficient way. The unique linear sequence of amino acids, is called polypeptide and is responsible for structure formation. The total of 20 different kinds of amino acids are (Table 1) and each amino acid is differentiated based on their own side chain. This side chain also determines the amino acid properties. The four groups of Amino acids are Acidic, Rudimental, Polar and Non- Polar. Polar and Non-Polar are again partitioned into Hydrophobic, which is magnetized towards dihydrogen monoxide and Hydrophilic, which is repelled by dihydrogen monoxide. The proportion of physio-chemical properties that sanction a concrete protein to compose into a concrete structure has not plenarily kenned. Lot many innate properties involved are not kenned in identifying the secondary structure of the amino acid sequence. R groups are distinguished with different tails is considered as the most consequential distinguishing feature of amino acids. Few other factors which includes the energy levels of the secondary structure that requires to be stable and low and links between amino acid sequences help us in determining the specific structure of a protein. A protein exhibits a biological activity only when it folds into a 3D structure. The goal is to identify the shape and structure (fold) that a given primary sequence will form. Each amino acid will “fit” best in any one of the available type of structure than other predicated on the shape, charge and size of its side chain.

Table1: Names and Symbols of 20 Amino Acids

Amino acid	Three letter code	One letter code
Alanine	ala	A
Arginine	arg	R
Asparagine	asn	N
Aspartic acid	asp	D
Asparagine or aspartic acid	asx	B
Cysteine	cys	C
Glutamic acid	glu	E
Glutamine	gln	Q
Glutamine or glutamic acid	glx	Z
Glycine	gly	G
Histidine	his	H
Isoleucine	ile	I
Leucine	leu	L
Lysine	lys	K
Methionine	met	M
Phenylalanine	phe	F
Proline	pro	P
Serine	ser	S
Threonine	thr	T
Tryptophan	trp	W
Tyrosine	tyr	Y
Valine	val	V

In this review paper, protein secondary structure using Ab-initio approach, based on protein primary structure and its physicochemical properties⁷were investigated. Various features extraction from protein sequence such as PSSM⁸, amino acid sequence and physiochemical properties like hydrophobicity, polarity, polarizability, secondary structure, solvent accessibility and binding and nonbinding propensity have been utilized in virtually all the research.

Computational Intelligence – Machine Learning Approaches

Machine learning algorithms starts with the training phase to identify the patterns, followed by building the model, and use the model to do predictions on the test data. In proteomics these algorithms are extensively applied in the field of biology and drug revelation. Few of the important Machine Learning Algorithms used for sequence data are - Hidden Markov Model (HMM), Support Vector Machine (SVM), Gaussian networks, K Nearest Neighbor (KNN), Naïve Bayes, Bayesian networks, Decision Tree, Artificial Neural Network (ANN) and Ensemble based approaches named Random Forest and Gradient Boosted Trees (GBT). These are applied in all the areas of biology including genomics, proteomics, systems biology, and many other domains. Sundry machine algorithms have been widely applied in bioinformatics because of the immensely colossal volume and variety of biomedical data accumulated ecumenical. Machine learning is utilized in many stages in the drug revelation pipeline including in the QSAR analysis stage and additionally scoring functions can be utilized in structure-predicated drug revelation to soothsay target–ligand interactions and binding affinities. Moreover these algorithms can be trained to distinguish active drugs from decoys that do not have kenned drug activity. To extract erudition from astronomically immense data in bioinformatics, computational methods can provide a first step in protein structure resoluteness, and sequence-predicated methods are routinely used to avail characterize protein structure. This section, provides a brief review on the machine learning approaches used for predicting the secondary structures.

Hidden Markov Models (HMM):

Markov Model, a supervised machine learning methodology, uses a probabilistic model with “hidden” as states are not visible or unobserved, is the simplest Bayesian network. This algorithm is mainly applied for sequence input data like protein sequence, gene data, voice data and NLP based approach for speech and hand writing recognition. In a Markov model sequence of data, the value at position n mainly dependent on the preceding k characters, where k is considered as the Markov chain order^9,10. Hence, the chain in Markov model is calculated as follows: given the past of the sequence in a long window of size k, the Markov model is mainly defined by the set of probabilities of each character in the sequence. The transition matrix can transmute along the sequence in the Markov chain. The hidden process is considered as the cull of the transition matrix and is organized by another Markovian process. Hidden Markov models represent sequence hetero geneity and can be utilized in predictive approaches and few algorithms named the forward-backward procedure and Viterbi algorithm would sanction and finalize the selection transition matrix that is utilized in addition to the observed sequence.

Fig. 1.A simple hidden Markov model with 3 states of protein

A simple HMM with three hidden states each with 3 class probabilities is depicted in (Fig.1). The top most line represent the secondary structure of the protein sequence and are named as H as α-helix, B as β-strand, C as coil. The arrows in the top most sequence shows the first order dependency of the Hidden Markov process. The lower line is the observed sequence which represents the amino acid sequence of the protein. Downward arrows represents the dependency between the observed sequence and the hidden chain. Forward/backward algorithm can be used to perform the recovery of the hidden process from observed sequence.

Support Vector Machine (SVM):

A learning model which is expertise in handling diverse data types, and its high accuracy and flexible in managing high of data dimensional data is Support Vector Machines are widely used in bioinformatics. Support Vector Machine (SVM) is a Supervised Machine Learning algorithm based on statistical learning concepts. The idea behind this learning algorithm is stated in the successive points. The first step is to map the vector inputs into one feature space either non-linearly or linearly, possibly a higher dimension, which is required for the selection of the kernel function. In the next step, construct a hyper plane which separates three classes from the first step, inside the feature space which seeks an optimized linear division. The 3 class are α-helix, β-strand, Coil which are considered as the three secondary structures. SVM can be used to perform regression and classification tasks as well, based on construction of hyper plane and are associated with learning algorithms. These are mainly used in patterns and is identified as a kernel function which are not linearly seperable^11,12 and is best suited for protein structure prediction problem. We can also use SVM to construct classifiers which distinguishes classifiers between parallel and anti-parallel beta sheets. These algorithms have high predictive power and are often used for classification of biological data.

Artificial Neural Network (ANN):

A biologically inspired learning algorithm, with has the ability in learning complex functions from large data sets, can be perfectly used for protein structure prediction with improved accuracy. It is mainly used in predicting the number of structural domains and secondary structures from structure and input amino acid sequence respectively. Also these algorithms auto matically extracts the features from input data which is one of the major factor in deciding the accuracy^13,14. These algorithms with different combination of network parameters can be applied to predict the tertiary and quaternary structures of protein, which aids in understanding the novel design of drug and enzyme therapy, and this is still in early stage of research. This can also be extended in predicting bio-toxicity of molecules as well. A more ideal solution in predicting the protein secondary structure is by using different network parameters with Forward and back propogation approaches and by increasing the node parameters.

Decision Trees:

A tree based supervised learning algorithm which is being widely used in various domains like banking, Retail, Healthcare and bioinformatics. This is mainly adopted due to its good interpretability and simplicity, was introduced by Leo Breiman et al. in 1984. In a Decision Tree algorithm , the values of the independent or input attributes is given which predicts the value of a dependent or output attribute and is ideally used as a classification tree, to learn a classification function. We could endeavor to learn an involute tree that best fits the training data¹⁵. While tree construction, few trees do not generalize well, as they leads to over-fit of the training data. We can also use tree pruning and other techniques to solve this problem. Several research says that, the accuracy of more advanced classification methods such as Artificial Neural networks or Support Vector Machines performs well when compared to a single decision tree, which limits its usage in accuracy-critical domains. Combining more number of decision trees which is called as ensemble of trees aids in improving the accuracy over single decision trees and also it out performs other supervised machine learning algorithms in terms of accuracy.

Ensemble Strategies - BAGGING, BOOSTING and RANDOM FOREST:

Ensemble approach is an efficacious technique that has been applied to cumulate multiple learning algorithms to amend overall presage precision. These techniques combine or average more than one Machine Learning models to reduce the potential for over fitting the training data and this leads to the advantage to alleviate the minute sample size quandary. By doing so, the input dataset used for training is utilized in a more efficient way, which is important to structure prognostication quandaries with diminutive sample size. Typically, an ensemble model is a supervised learning technique for cumulating multiple impuissant learners or models to engender a vigorous learner with the concept of Bagging and Boosting for data sampling which increase precision on a variety of ML tasks. A sizably voluminous types of ensemble methods have been applied to data analysis in biology field. The three most popular ensemble methods used in almost all the areas are Bagging, Boosting, and Random forests^16,17,18 depicted in (Fig. 2).

Bagging- its abbreviation is Bootstrap Aggregation is mainly used to decrease the variance of the prediction by creating additional data for training from the actual dataset utilizing coalescences with repetitions to generate multisets of the same size as the actual data. By incrementing the training dataset size we can't ameliorate the model predictive force, instead we can just decrease the variance, by tuning the presage for expected outcome¹⁹.

Another one is a two step process named as Boosting²⁰, which uses subsets of the original data to produce a series of averagely performing models in the first step and in the next step "boosts" their performance by combining them together using a majority vote.

Random Forest- It is more like bagging method which enforces the diversity between base classifiers in an ensemble way^21,22. Advantage of Ensemble methods is that it almost have lower classification error and generally faster than SVM. It depicts us a general idea of the features used and the most important features which is highly impacted. Also it easily handles missing data also. Another variant of Random Forest is Gradient Decent Trees (GBT), which also shows promising results in classification problems.

Bagging, Boosting, Random Forests:

High Performance Computing systems:

HPC systems are mainly used for quick processing of huge data. This can be mainly achieved by having an efficient distributed environment with more than one independent computersto process huge volume and variety of data²³. Currently, Big Data is one of the hottest topics in computer science, due to the rapid increase of the amount of data we, as a society, produce and store. The driving force behind this increase is a dramatic drop in the costs of collecting and storing this data. This decrease in costs is not only observed in social data and the internet of things (IoT) but also in the area of proteomics. This means that the amount of proteomics data is growing faster than the capacity to do computations on it. At the same time we would like to have our results faster, for example to start a treatment sooner rather than later. To be able to obtain maximum value from this data we need a system that can scale as the data grows.

HADOOP - MR Concepts:

Hadoop is a distributed framework to perform parallel processing and it follows the Map-Reduce parallel processing model. Map-Reduce programming model has been applied in many successful Cloud providers, such as Amazon EC2, Yahoo, Google, IBM and is mainly used for huge data processing^24,25. The two fundamental steps involved in Map Reduce phase are: Map Phase and Reduce Phase depicted in (Fig.3).The input and output data are converted into key/value pairs²⁶. Map phase takes a series of key and value pairs, and generates the processed key and value pairs, which are then passed to the relevant reducer function. Before the Reducer function begins data sorting and data shuffling happens and finally the Reduce phase iterates through the data values associated with specific key and generatesthe outputs.

Fig.3: The procedure of Hadoop Map/Reduce Model

The execution time of the algorithm using Hadoop Framework was investigated for different dataset size and by changing the number of Map/Reduce operations^27,28. The number of map operations is having a direct impact on the execution time and it effectively reduces the execution time. The inference here is that the distributed computing framework Hadoop, significantly reduces the computational cost on this structure prediction approach. The research studies show that any Machine learning approach implemented in Map-Reduce framework provides a better and ideal solution for the protein structure prediction problem, compared with the implementation in sequential computational node.

SPARK - RDD Concepts:

Spark is a general purpose, open source cluster computing framework and was built on top of Hadoop Map Reduce framework which extends the MR model for efficient computations with Real Time Interactive Queries and Batch/Online Stream Processing. Resilient Distributed Dataset (RDD)²⁹is the core behind SPARK, is an immutable object collection which is partitioned and distributed across multiple nodes of a Hadoop cluster which isoperated in distributed mode. An ideal solution for organizing large proteomics analysis pipelines and workflows is mainly achieved using Spark and is compatible with the Hadoop platform which provides an easy deployment and support within existing bioinformatics applications. It also supports various languages such as R, Python, Java, Scalaand query based feature using Spark SQL with the ease for practicing researches in bioinformatics field. The API's will require few customizations on rewriting many of the inbuilt methods, tools, and algorithms to hope up with proteomics data³⁰. This framework will also be used in Genomic data along with SPARK and ADAM and is giving very promising results. Most of the computations in protein data are naturally parallelizable and so pipelines can often be used to achieve parallelism between stages. Spark RDDs and Partitioners allow declarative parallelization for proteomics and is parallelized in a small, standard number of ways by position or by sample³¹. Spark has its own advantages and few of them are listed here:

1) The RDD concept in Spark store data on memory and persist it as per the requirements. The in-memory capability of RDD allows us to increase the performance of batch Jobs.

2) Again this cache capability of Spark framework, efficiently process iterative algorithms, which is also a major requirement for Machine Learning algorithm.

3) The stream processing capability of Spark framework with large input data mainly aids to speed up the processing and is nowadays considered as one of the ubiquitous requirement in the industry.

Our aim is to improve the performance of the approach along with accuracy on predicting the protein secondary structures. SPARK, which is a distributed processing system will help on enhancing the performance.

Research Recommendation:

Machine Learning methods in distributed environment shows promising results in proteomics and especially in predicting protein secondary structures, which speed ups the approach. In this novel area, few intensive research outcomes, including Neural Network and Ensemble based technique have also been accounted by us^32,33. The amino acid features involved and feature extraction techniques were also investigated for prediction accuracy. (Fig.4) depicts the proposed recommended Architecture that can be used for predicting the secondary structures. The Proposed Architecture consists of four different layers 1) Dataset Ingestion 2) Preprocessing 3) Modelling 4) Distributed Framework.

Fig.4:Recommended ML approach for predicting secondary structure of protein

Dataset Ingestion:

The datasets were obtained from Protein Data Bank (PDB) repository and can also use FASTA file format. PDB datasets were submitted by biologists and biochemists obtained using techniques like X-ray crystallography, NMR spectroscopy which is considered as expensive and time consuming. It contains the 3D structural data of bio-molecules of proteins and nucleic acids.

Pre-processing:

The preprocessing of the input data involves the extraction of features from the input PDB dataset. PSSM Profiles, amino acid sequence, secondary structure, hydrophobicity, polarizability, solvent accessibility, polarity, and binding and nonbinding propensity have been used in almost all the research and extract the values from these properties in any of the format like Matrix and Vector. Dictionary of Protein Secondary structures (DSSP) has the following notations, H for α-helix, G for 310-helix, I for Π-helix, E for β-strand, B for β- isolated, T for turn, S for bend and - for other structures. Our Research mainly focuses on 3 types of secondary structures and we have in turn grouped the eight secondary structures into three types; Helix, Sheet, Coils.

Modelling:

In our work, we proposed to use any one of the machine learning models or a combination of more than one model or any ensemble algorithm can be used here. Random Forest Algorithm’s default feature selection technique, Gini/Permutation is being used for extracting the top features and the decision trees have been constructed using Random Forest algorithm. The Random Forest algorithm will be implemented by using the packages random Forest, H2O using R programming. Gini importance or permutation importance is mainly being used to get the importance of the extracted features which plays a major role in the accuracy of the model. The node impurity is a measure of the homogeneity of the labels at the node. The two impurity measures for classification are Gini impurity and entropy. Few considerations including usage of different classifier model, number of trees, maximum depth, subsampling rate, feature subset strategy in appropriate levels will find a way for accuracy improvement.

Distributed Framework:

Apache SPARK is used to accelerate the process and the best model comes out as classifier output with the predicted secondary structures. Spark MLib³⁴module of Apache Spark is mainly used to provide distributed machine learning algorithms on top of Spark‘s RDD abstraction. The data frame based Machine Learning higher-level API under org.apache.spark.ml is used for implementing the classification Algorithm with improved parallelism. The prediction algorithm is a Machine Learning approach which is implemented in Apache SPARK frame work to achieve better performance. A Linux machine with two CPUs, each having 8 cores, and a RAM size of 128 GBytecan be used for this computation. Various other Machine Learning Architectures using Distributed Framework SPARK can be applied for better prediction Results. Our goal is to simplify the development and usage of Machine Learning algorithm. Our proposed Architecture uses SPARK as its computing framework.

CONCLUSION:

Various experiments have been captured in the review, which highlights the efficiency of Machine Learning techniques and is being combined with traditional approaches to study various pharmaceutical problems. It is the combination of computational approaches that encompass techniques such as structure prediction, together with the interpretation of related experimental function, which is essential to provide a comprehensive understanding of the motions in proteins and their assemblies. Information on the latter is crucial when synthesizing improved biomolecules and designing new drugs. Machine learning methods implemented with a parallel processing approach is used in predicting protein secondary structures to speed up the approach. This survey paper has focused the various aspects required to improve the performance and accuracy of the protein structure prediction techniques. As a conclusion, the combination of Machine Learning together with Spark based protein structure prediction modeling approach will show good results and is being adopted in all the stages of drug discovery using protein data. Our future work will be focused more on improving the accuracy and execution time, as implementation of computational intelligence techniques is going to be a critical area in proteomics field.

REFERENCES:

1 Luo RY, Feng ZP, Liu JK. Prediction of protein structural class by amino acid and polypeptide composition. European Journal of Biochemistry. 2002;269 (17):4219-25.

2 Qu W, Sui H, Yang B, Qian W. Improving protein secondary structure prediction using a multi-modal BP method. Computers in biology and medicine. 2011;41(10):946-59.

3 Valencia A, Pazos F. Computational methods for the prediction of protein interactions. Current opinion in structural biology. 2002;12 (3):368-73.

4 Jain P, Garibaldi JM, Hirst JD. Supervised machine learning algorithms for protein structure classification. Computational biology and chemistry. 2009;33 (3):216-23.

5 Li ZC, Zhou XB, Lin YR, Zou XY. Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino acids. 2008;35 (3):581-90.

6 Bondugula R, Duzlevski O, Xu D. Profiles and fuzzy K-nearest neighbor algorithm for protein secondary structure prediction. InAPBC 2005; pp. 85-94.

7 Toussaint NC, Widmer C, Kohlbacher O, Rätsch G. Exploiting physico-chemical properties in string kernels. BMC bioinformatics. 2010; 11(8):S7.

8 Li D, Li T, Cong P, Xiong W, Sun J. A novel structural position-specific scoring matrix for the prediction of protein secondary structures. Bioinformatics. 2012;28(1):32-9.

9 Won KJ, Hamelryck T, Prügel-Bennett A, Krogh A. An evolutionary method for learning HMM structure: prediction of protein secondary structure. BMC bioinformatics. 2007;8 (1):357.

10 Martin J, Gibrat JF, Rodolphe F. Hidden Markov Model for protein secondary structure. In International Symposium on Applied Stochastic Models and Data Analysis 2005.

11 Johal AK, Singh R. Protein secondary structure prediction using improved support vector machine and neural networks. International Journal of Engineering and Computer Science. 2014;3(1):3593-7.

12 Ceroni A, Frasconi P, Passerini A, Vullo A. A combination of support vector machines and bidirectional recurrent neural networks for protein secondary structure prediction. In Congress of the Italian Association for Artificial Intelligence 2003; pp. 142-153. Springer Berlin Heidelberg.

13 Deka A, Sarma KK. Artificial neural network aided protein structure prediction. Int. J. Comput. Appl. 2012;48(18):33-7.

14 Thalatam MV, Rao PV, Varma KV, Murty NV, Apparao A. Prediction of Protein Secondary Structure using Artificial Neural Network. International Journal on Computer Science and Engineering. 2010;2(5):1615-21.

15 Toca CE, Chamorro AE, Cortés GA, Aguilar-Ruiz JS. A decision tree-based method for protein contact map prediction. In European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics 2011; pp. 153-158. Springer Berlin Heidelberg.

16 Bouziane H, Messabih B, Chouarfia A. Profiles and majority voting-based ensemble method for protein secondary structure prediction. Evolutionary bioinformatics online. 2011;7:171.

17 Huang DS, Huang X. Improved performance in protein secondary structure prediction by combining multiple predictions. Protein and peptide letters. 2006;13(10):985-91.

18 Dehzangi A, Phon-Amnuaisuk S, Dehzangi O. Using Random Forest for Protein Fold Prediction Problem: An Empirical Study. J. Inf. Sci. Eng. 2010;26(6):1941-56.

19 Yang P, Hwa Yang Y, B Zhou B, Y Zomaya A. A review of ensemble methods in bioinformatics. Current Bioinformatics. 2010;5(4):296-308.

20 Jo T, Cheng J. Improving protein fold recognition by random forest. BMC bioinformatics. 2014;15(11):S14.

21 Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning. 2000;40(2):139-57.

22 Bauer E, Kohavi R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning. 1999;36(1):105-39.

23 Zhang C, De Sterck H, Aboulnaga A, Djambazian H, Sladek R. Case study of scientific data processing on a cloud using hadoop. InHigh performance computing systems and applications 2010;pp. 400-415. Springer Berlin Heidelberg.

24 Zhou C. Fast parallelization of differential evolution algorithm using MapReduce. In Proceedings of the 12th annual conference on Genetic and evolutionary computation 2010;pp. 1113-1114. ACM.

25 Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics. 2010;11(12):S1.

26 Zhang C, De Sterck H, Aboulnaga A, Djambazian H, Sladek R. Case study of scientific data processing on a cloud using hadoop. In High performance computing systems and applications 2010; pp. 400-415. Springer Berlin Heidelberg.

27 Schatz MC. Cloudbursts: highly sensitive read mapping with MapReduce. Bioinformatics. 2009;25(11):1363-9.

28 Hung CL, Lin YL. Implementation of a parallel protein structure alignment service on cloud. International journal of genomics. 2013 Mar 25.

29 Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster Computing with Working Sets. Hot Cloud. 2010;10(10-10):95.

30 Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation 2012;pp. 2-2. USENIX Association.

31 Venkataraman S, Yang Z, Liu D, Liang E, Falaki H, Meng X, Xin R, Ghodsi A, Franklin M, Stoica I, Zaharia M. Sparkr: Scaling r programs with spark. In Proceedings of the 2016 International Conference on Management of Data 2016;pp. 1099-1104. ACM.

32 Leo Dencelin X, Ramkumar T. Analysis of multilayer perceptron machine learning approach in classifying protein secondary structures. Biomedical Research. Special section. Computational Life Science and Smarter Technological Advancement. 2016; 166-173.

33 Leo Dencelin X, Ramkumar T. A Distributed Tree-based Ensemble Learning Approach for Efficient Structure Prediction of Protein. International Journal of Intelligent Engineering and Systems.2017;10(3):226-234.

34 Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D. Mllib: Machine learning in apache spark. Journal of Machine Learning Research. 2016;17(34):1-7.

Received on 07.06.2017 Modified on 07.07.2017

Research J. Pharm. and Tech. 2017; 10(9): 3173-3180.

DOI: 10.5958/0974-360X.2017.00564.9